<<<<<<< HEAD ======= >>>>>>> ac5039a8f46c9261629698a2de1ee4dc80d0aac9

11th May 2021

Outline

Introduction

Materials

  • Description of data set

Methods

  • Data cleaning and wrangling
  • Vizualization
  • Modelling

Results

  • Data visualization
  • Logistic regression
  • PCA
  • K-means clustering

Discussion and Conclusion

Introduction

Introduction

  • Prostate cancer data set from Andrews DF and Herzberg AM (1985)
  • Compare four different treatments
  • 502 observations of 18 variables
  • 27 NA values

Materials

The data set

The Raw Data set

Variables in the data set, 1/2
Variable Name Desription Level / Summary Statistic
patno Patient number
rx Dosage of estrogen treatment in mg Placebo, 0.2 mg, 1.0 mg, 5.0 mg
status Status of the patient, alive or or cause of death alive, dead - prostatic ca, dead - heart or vascular, dead - cerebrovascular, dead - pulmonary embolus, dead - other ca, dead - respiratory disease, dead - other specific non-ca, dead - unspecified non-ca, dead - unknown cause
pf Activity level based on time in bed normal activity, in bed < 50% daytime, in bed > 50% daytime, confined to bed
bm The presence of Bone Metastases Yes/No
hx History of Cardiovascular Disease Yes/No
ekg Electrocardiography normal, benign, rhythmic disturb & electrolyte ch, heart block or conduction def, heart strain, old MI, recent MI

The Data set

Variables in the data set, 2/2
Variable Name Desription Level / Summary Statistic
stage Stage of prostate cancer 3 or 4
age Age of patient in years 71.51 (48.00 ~ 89.00)
wt Weight Index = wt(kg)-ht(cm)+200 98.93 (69.00 ~ 152.00)
sbp Systolic Blood Pressure/10 14.35 (8.00 ~ 30.00)
dbp Diastolic Blood Pressure/10 8.15 (4.00 ~ 18.00)
hg Serum Hemoglobin [g/100ml] 13.45 (5.90 ~ 21.20)
sz Size of Primary Tumor [cm^2] 14.57 (0.00 ~ 69.00)
ap Serum Prostatic Acid Phosphatase 12.18 (0.01 ~ 999.88)
sdate Date of study
sg Combined Index of Stage and Hist. Grade
dtime Months of Follow-up

Methods

Methods, Data cleaning and wrangling

Raw data to Clean data

  • Exclude dtime, sdate and sg
  • Renaming variables

Clean data to Augment data

  • Add five new variables:
    • outcome = mutating status into alive vs dead
    • treatment_mg = making a numeric variable by removing string
    • EKG_lvl = creating a factorized variable with the levels in numbers
    • performance_lvl = creating a factorized variable with the levels in numbers
    • age_group = forming two age groups young/old based on the mean

Methods

Visualization of pre-treatment variables - Numeric

Methods

Visualization of pre-treatment variables - Categorical

Methods

Visualization of pre-treatment variables - Heatmap

Methods

Modelling

  • Logistic regression
  • PCA
  • K-means clustering

Results

Results

Treatment, outcome and age

Results

Logistic model - Outcome as function of treatment

Output:

log_mod_treatment
## # A tibble: 4 x 5
##   term            estimate std.error statistic     p.value
##   <chr>              <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)       1.11       0.211     5.27  0.000000136
## 2 treatment_mg0.2   0.0907     0.301     0.301 0.763      
## 3 treatment_mg1    -0.807      0.280    -2.88  0.00394    
## 4 treatment_mg5    -0.0536     0.294    -0.182 0.855

Results

Logistic model for each variable - Treatment = 1.0 mg

Results

Logistic model for each variable - Treatment = 1.0 mg

Results

Distribution of significant variables for each outcome

Results

Principal Component Analysis

Results

Principal Component Analysis

Results

Principal Component Analysis

Results

Principal Component Analysis

Results

K-means clustering

Results

K-means clustering

<<<<<<< HEAD

=======

Results

K-means clustering

>>>>>>> ac5039a8f46c9261629698a2de1ee4dc80d0aac9

<<<<<<< HEAD

K-means clustering

=======

Discussion and conclusion

>>>>>>> ac5039a8f46c9261629698a2de1ee4dc80d0aac9

Discussion and conclusion

  • Stage 3 and 4 patients differ in tumor size and acid phosphatase levels
  • Most effective treatment is 1.0 mg estrogen
  • Significant variables are tumor size, CVD, age, and weight index
  • Based on PCA, bone_mets,tumor_size and stage have strongly corralation.
  • CVD and hemglobin, dbp and sbp have relationship separately
  • and they both have relationship with weight_index.
  • Therefore, the reason resulting in dealth could have 3 class:healthy situation, cancer and age.
  • If the treatment is helpful to improve surval, it is unclear.